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Abstract 

Persistent homology (PH) is a method used in topological data analysis (TDA) to study 
qualitative features of data that persist across multiple scales. It is robust to perturba¬ 
tions of input data, independent of dimensions and coordinates, and provides a compact 
representation of the qualitative features of the input. There has been recent progress, but 
the computation of PH remains an open area with numerous important and fascinating 
challenges. The held of PH computation is evolving rapidly, and new algorithms and soft¬ 
ware implementations are being updated and released at a rapid pace. The purposes of 
our article are to (1) introduce theory and computational methods for PH to a broad range 
of computational scientists and (2) provide benchmarks of state-of-the-art implementations 
for the computation of PH. We give a friendly introduction to PH, navigate the pipeline 
for the computation of PH with an eye towards applications, and use a range of synthetic 
and real-world data sets to evaluate currently available open-source implementations for the 
computation of PH. Based on our benchmarking, we indicate which algorithms and imple¬ 
mentations are best suited to different types of data sets. In an accompanying tutorial, we 
provide guidelines for the computation of PH. We make publicly available all scripts that 
we wrote for the tutorial, and we make available the processed version of the data sets used 
in the benchmarking. 


1 Introduction 


The amount of available data has increased dramatically in recent years, and this situation 
— which will only become more extreme — necessitates the development of innovative and 
efficient data-processing methods. Making sense of the vast amount of data is difficult: on one 
hand, the sheer size of the data poses challenges; on the other hand, the complexity of the 
data, which includes situations in which data is noisy, high-dimensional, and/or incomplete, is 
perhaps an even more significant challenge. The use of clustering techniques and other ideas 
from areas such as computer science, machine learning, and uncertainty quantification — along 
with mathematical and statistical models — are often very useful for data analysis (see, e.g., 
60,64 78 112 and many other references). However, recent mathematical developments are 
shedding new light on such “traditional” ideas, forging new approaches of their own, and helping 
people to better decipher increasingly complicated structure in data. 

Techniques from the relatively new discipline of “topological data analysis” (TDA) have 
provided a wealth of new insights in the study of data in an increasingly diverse set of appli¬ 
cations — including coverage in sensor networks 39 , protein structure 59 80 130|, stability of 
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ries 


fullerene molecules 129 , robotics ^ 109 123 , signals in images 30 68 , periodicity in time se- 


106 , breast-cancer classification [43||103 , viral evolution 26 , natural images 24 , the spread 


of contagions 90||119 , the structure of amorphous materials 70 , force networks in granular mat 
81,82 , equities-market networks 87 , diverse applications in neuroscience 35 63 , collective 


ter 

behavior in biology 121 , and time-series output of dynamical systems 91 . There are numer¬ 


ous others, and new applications of TDA appear in journals and preprint servers increasingly 
frequently. There are also interesting computational efforts, such as 


127 


TDA is a field that lies at the intersection of data analysis, algebraic topology, computational 
geometry, computer science, statistics, and other related areas. The main goal of TDA is to use 
ideas and results from geometry and topology to develop tools for studying qualitative features of 
data. To achieve this goal, one needs precise definitions of qualitative features, tools to compute 
them in practice, and some guarantee about the robustness of those features. One way to address 
all three points is a method in TDA called persistent homology (PH). This method is appealing for 
applications because it is based on algebraic topology, which gives a well-understood theoretical 
framework to study qualitative features of data with complex structure, is computable via linear 
algebra, and is robust with respect to small perturbations in input data. 

Types of data sets that can be studied with PH include finite metric spaces, digital images, 
level-sets of real-valued functions, and networks (see Section 5.1). In the next two paragraphs, 
we give some motivation for the main ideas of persistent homology by discussing two examples 
of such data sets. 

Finite metric spaces are also called “point cloud” data sets in the TDA literature. From a 
topological point of view, finite metric spaces do not contain any interesting information. One 
thus considers a “thickening” of a point cloud at different “scales of resolution” and then analyses 
the evolution of the resulting shape across the different resolution scales. The qualitative features 
are given by topological invariants, and one can represent the variation of such invariants across 
the different resolution scales in a compact way to summarize the “shape” of the data. 

As an illustration, consider the set of points in that we show in Fig. Let e, which we 
interpret as a “distance parameter,” be a nonnegative real number (so e = 0 gives the set of 
points). For different values of e, we construct a space Se composed of vertices, edges, triangles, 
and higher-dimensional polytopes according to the following rule: We include an edge between 
two points i and j if and only if the Euclidean distance between them is no larger than e; we 
include a triangle if and only if all of its edges are in S^', we include a tetrahedron if and only 
if all of its face triangles are in and so on. For e < e', it then follows that the space is 
contained in the space ■ This yields a nested sequence of spaces, as we illustrate in Fig. [^a). 


By using homology, a tool in algebraic topology, one can measure several features of the 
spaces Se — including the numbers of components, holes, and voids (higher-dimensional versions 
of holes). One can then represent the lifetime of such features using a finite collection of intervals 
known as a “barcode.” Roughly, the left endpoint of an interval represents the birth of a feature, 
and its right endpoint represents the death of the same feature. In Fig. Sb), we reproduce such 
intervals for the number of components (blue solid lines) and the number of holes (violet dashed 
lines). 

In Fig.Qb), we observe a dashed line that is significantly longer than the other dashed lines. 
This indicates that the data set has a long-lived hole. By contrast, one can potentially construe 
the shorter dashed lines as nois^ When a feature is still “alive” at the largest value of e that we 
consider, the lifetime interval is an infinite interval, which we indicate by putting an arrowhead 


^However, note that shorter lines cannot always be construed as noise (see, e.g.. 


117 ). 
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Figure 1: (a) A finite set of points in (for e = 0) and a nested sequence of spaces obtained from 
it (from e = 0 to e = 2.1). (b) Barcode for the nested sequence of spaces illustrated in (a). Solid lines 
represent the lifetime of components, and dashed lines represent the lifetime of holes. 


at the right endpoint of the interval. In Fig.[^b), we see that there is exactly one solid line that 
lives up to e = 2.1. One can use information about shorter solid lines to extract information 
about how data is clustered in a similar way as with linkage-clustering methods 60 . Note that 


one of the most difficult parts of using PH is the statistical interpretation of results. We discuss 
such interpretation further in Section |5.4| Our construction of nested spaces gives an example 
of a “filtered Vietoris-Rips complex,” which we define and discuss in Section |5.2[ 

We now discuss a second example related to digital images. (For an illustration, see Fig.j^a).) 
Digital images have a “cubical” structure, given by the pixels (for 2-dimensional digital images) 
or voxels (for 3-dimensional images). Therefore, one approach to study digital images uses 
combinatorial structures called “cubical complexes.” (For a different approach to the study of 
digital images, see Section [KT| ) Roughly, cubical complexes are topological spaces built from 
a union of vertices, edges, squares, cubes, and higher-dimensional hypercubes. An efficient 
way 126 to build a cubical complex from a 2-dimensional digital image consists of assigning 
a vertex to every pixel, then joining vertices corresponding to adjacent pixels by an edge, and 
filling in the resulting squares. . One proceeds in a similar way for 3-dimensional images. 
One then labels every vertex with an integer that corresponds to the gray value of the pixel, 
and one labels edges (respectively, squares) with the maximum of the values of the adjacent 
vertices (respectively, edges). One can then construct a nested sequence of cubical complexes 
Co C Cl C • • • C C 256 , where for each i S {0,1,..., 256}, the cubical complex Ci contains all 
vertices, edges, squares, and cubes that are labeled by a number less than or equal to i. (See 
Fig-i: c) for an example.) Such a sequence of cubical complexes is also called a “filtered cubical 
complex.” Similar to the previous example, one can use homology to measure several features 
of the spaces Ci (see Fig.j^d)). 

In the present article, we focus on persistent homology, but there are also other methods in 
TDA — including the Mapper algorithm 114 , Euler calculus (see for an introduction with 
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Figure 2: (a) A gray-scale image, (b) the matrix of gray values, (c) the filtered cubical complex 

associated to the digital image, and (d) the barcode for the nested sequence of spaces in panel (c). A 
solid line represents the lifetime of a component, and a dashed line represents the lifetime of a hole. 
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an eye towards applications), cellular sheaves 34|[^ , and many more. We refer readers who wish 
to learn more about the foundations of TDA to the article 22 , which discusses why topology 


and functoriality are essential for data analysis. We point to several introductory papers, books, 
and two videos on PH at the end of Section ID 

The first algorithm for the computation of PH was introduced for computation over F 2 (the 
field with two elements) in 53 and over general fields in 137 . Since then, several algorithms and 


optimization techniques have been presented, and there are now various powerful implementa¬ 
tions of PH [5ppp^||97p00pl8 . Those wishing to try PH for computations may hnd it difficult 


to discern which implementations and algorithms are best suited for a given task. The field of PH 
is evolving continually, and new software implementations and updates are released at a rapid 
pace. Not all of them are well-documented, and (as is well-known in the TDA community), the 
computation of PH for large data sets is computationally very expensive. 

To our knowledge, there exists neither an overview of the various computational methods for 
PH nor a comprehensive benchmarking of the state-of-the-art implementations for the computa¬ 
tion of persistent homology. In the present article, we close this gap: we introduce computation 
of PH to a general audience of applied mathematicians and computational scientists, offer guide¬ 
lines for the computation of PH, and test the existing open-source published libraries for the 
computation of PH. 

The rest of our paper is organized as follows. In Section we discuss related work. We then 
introduce homology in Section and introduce PH in Section We discuss the various steps 
of the pipeline for the computation of PH in Section and we briefly examine algorithms for 
“generalized persistence” in Section In Section we give an overview of software libraries, 
discuss our benchmarking of a collection of them, and provide guidelines for which software or 
algorithm is better suited to which data set. (We provide specific guidelines for the computation 
of PH with the different libraries in the Tutorial in the Supplementary Information (SI).) In 
Section we discuss future directions for the computation of PH. 


2 Related work 


In our work, we introduce PH to non-experts with an eye towards applications, and we benchmark 
state-of-the-art libraries for the computation of PH. In this section, we discuss related work for 
both of these points. 

There are several excellent introductions to the theory of PH (see the references at the end 
of Section 4.1 ), but none of them emphasizes the actual computation of PH by providing specihc 
guidelines for people who want to do computations. In the present paper, we navigate the 
theory of PH with an eye towards applications, and we provide guidelines for the computation 
of PH using the open-source libraries Perseus, Dionysus, DIPHA, javaPlex, and Gudhi. 
We include a tutorial (see the SI) that gives specific instructions for how to use the different 
functionalities that are implemented in these libraries. Much of this information is scattered 
throughout numerous different papers, websites, and even source code of implementations, and 
we believe that it is beneficial to the applied mathematics community (especially people who 
seek an entry point into PH) to find all of this information in one place. The functionalities 
that we cover include plots of barcodes and persistence diagrams and the computation of PH 
with Vietoris-Rips complexes, alpha complexes, Cech complexes, witness complexes, and cubical 
complexes for image data. We thus believe that our paper closes a gap in introducing PH to people 
interested in applications, while our tutorial complements existing tutorials (see, e.g. [2 19 ^). 

We believe that there is a need for a thorough benchmarking of the state-of-the-art libraries. 
In our work, we use twelve different data sets to test and compare the libraries javaPlex, 
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compares packages from different authors. However, Maria compares only up to three different 
implementations at one time, and he used the package jPlex (which is no longer maintained) 
instead of the javaPlex library (its successor). Additionally, the widely used library Perseus 
(e.g., it was used in ^82 115, 119|) does not appear in Maria’s benchmarking. 


3 Homology 


Assume that one is given data that lies in a metric space, such as a subset of Euclidean space 
with an inherited distance function. In many situations, one is not interested in the precise 
geometry of these spaces, but instead seeks to understand some basic characteristics, such as the 
number of components or the existence of holes and voids. Algebraic topology captures these 
basic characteristics either by counting them or by associating vector spaces or more sophisti¬ 
cated algebraic structures to them. Here we are interested in homology, which associates one 
vector space Hi{X) to a space X for each natural number i G {0,1,2,...}. The dimension of 
Hq{X) counts the number of path components in X, the dimension of Hi{X) is a count of the 
number of holes, and the dimension of H 2 {X) is a count of the number of voids. An important 
property of these algebraic structures is that they are robust, as they do not change when the 
underlying space is transformed by bending, stretching, and other deformations. In technical 
terms, they are homotopy invariant. Conversely, under favorable condition^ these algebraic 
invariants determine the topology of a space up to homotopy — an equivalence relation that is 
much coarser (and easier to work with) than the more familiar notion of homeomorphy. 

It can be very difficult to compute the homology of arbitrary topological spaces. We thus 
approximate our spaces by combinatorial structures called “simplicial complexes,” for which 
homology can be easily computed algorithmically. Indeed, often one is not even given the space 
X, but instead possesses only a discrete sample set S from which to build a simplicial complex 


following one of the recipes described in Sections 3.2 and 5.2 


3.1 Simplicial complexes and their homology 

We begin by giving the definitions of simplicial complexes and of the maps between them. 
Roughly, a simplicial complex is a space that is built from a union of points, edges, trian¬ 
gles, tetrahedra, and higher-dimensional polytopes. We illustrate the main definitions given in 
this section with the example in Fig. As we pointed out in Section “cubical complexes” 
give another way to associate a combinatorial structure to a topological space. In TDA, cubical 
complexes have been used primarily to study image data sets. One can compute PH for a nested 
sequence of cubical complexes in a similar way as for simplicial complexes, but the theory of 
PH for simplicial complexes is richer, and we therefore examine only simplicial homology and 
complexes in our discussions. 


^See [69[ Corollary 4.33]. 















7 


Definition 1 A simplicial complejj^ is a collection K of non-empty subsets of a set Kq such 
that T <Z a and a G K guarantees that t € K and {?;} € K for all v € Kq. The elements of 
Kq are called vertices of K , and the elements of K are called simplices. Additionally, we say 
that a simplex has dimension p or is a p-simplex if it has a cardinality of p -\- 1. We use Kp 
to denote the collection of p-simplices. The fc-skeleton of K is the union of the sets Kp for all 
p € {0,1,..., k}. If T and a are simplices such that t G a, then we call r a face of a. The 
dimension of K is defined as the maximum of the dimensions of its simplices. A map of simplicial 
complexes, f : K ^ L, is a map f: Kq — >■ Lq such that /(cr) G L for all a € K. 


We give an example of a simplicial complex in Fig. l^[a) and an example of a map of simplicial 


complexes in Fig. 1 b) Definition]^ is rather abstract, but one can always interpret a finite sim¬ 
plicial complex K geometrically as a subset of for sufficiently large N ; such a subset is called 
a “geometric realization,” and it is unique up to a canonical piecewise-linear homeomorphism. 
For example, the simplicial complex in Fig. l^[a) has a geometric realization given by the subset 
of in Fig. . 


(a) A simplicial complex: 

K = {{a}, {b}, {c}, {4, {e}, {a, b}, {a, c}, {a, d}, {b, c}, {c, d} , {a, b, c}} . 

(b) Let K be the simplicial complex in |(a)[ and let K' be the following simplicial complex: 

K' = {{ 4 , {h}, {1}, {g, h}, {g, 1}, {h, 1}, {g, h, 1}} . 

We define a map of simplicial complexes, /: AT —>■ AT', as follows: 


/ • Kq —>■ Kq : X I— 


h , if x=a. 
g, otherwise. 


(c) Geometric realizations of the simplicial complexes in [R] and m are the following 
subsets of : 


Figure 3: (a) Example of a simplicial complex, (b) a map of simplicial complexes, and (c) a geometric 
realization of the simplicial complex in (a). 

We now define homology for simplicial complexes. Let F 2 denote the field with two elements. 
Given a simplicial complex AT, let Cp{K) denote the F 2 -vector space with basis given by the 
p-simplices of AT. For any p G {1,2,...}, we define the linear map 

dp: Cp{K) —> Cp-i{K): a 1 —>■ ^ ^ t . 

TCCT.TdKp-i 

For p = 0, we define do to be the zero map. In words, dp maps each p-simplex to its boundary, 
the sum of its faces of codimension 1. Because the boundary of a boundary is always empty, the 


®Note that this is usually called an “abstract simplicial complex” in the literature. 
























linear maps dp have the property that composing any two consecutive maps yields the zero map: 
for all p G {0,1,2,...}, we have dp o dp+i = 0. Consequently, the image of dp+i is contained in 
the kernel of dp, so we can take the quotient of kernel((ip) by image((ip+i). We can thus make 
the following definition. 


Definition 2 For any p G {0,1,2,...}, the pth homology of a simplicial complex K is the 
quotient vector space 

Hp{K) := kernel(dp) / image(dp+i). 

Its dimension 

Pp{K) := dimHp{K) = dimkernel(dp) — dimimage(dp+i) 

is called the pth Betti number of K. Elements in the image of dp+i are called p-boundaries, and 
elements in the kernel of dp are called p-cycles. 

Intuitively, the p-cycles that are not boundaries represent p-dimensional holes. Therefore, the 
pth Betti number “counts” the number of p-holes. Additionally, if AT is a simplicial complex of 
dimension n, then for all p > n, we have that Hp{K) — 0, as Kp is empty and hence Cp{K) = 0. 
We therefore obtain the following sequence of vector spaces and linear maps: 


0 


CniK) 


A Ci{K) A Co{K) A 0 . 


We give an example of such a sequence in Fig. "{[a) for which we also report the Betti numbers. 

One of the most important properties of simplicial homology is “functoriality.” Any map 
/: AT —>■ AT' of simplicial complexes induces the following F 2 -linear map: 

Jp-.Cp{K)^Cp{K')-. Y. Co-O’ I—t 'y ' Co-/(o’), for any p S {0,1,2 ,... } , 

(J^Kp cr£Kp such that f{(T)£K' 


where c^ G F 2 . Additionally, fp o dp+i = dp+i o /p+i, and the map fp therefore induces the 
following linear map between homology vector spaces: 

fp-. Hp{K) ^ Hp{K'), 

[c] ^ [/p(c)] . 


(We give an example of such a map in Fig. m ) Consequently, to any map f: K ^ K' oi 
simplicial complexes, we can assign a map fp: Hp{K) -G Hp{K') for any p G (0,1,2,...}. This 
assignment has the important property that given a pair of composeable maps of simplicial 
complexes, /: AT —>■ K' and g-. K' —AT", the map {go f)p'. Hp{K) -G Hp{K") is equal to the 
composition of the maps induced by / and g. That is, {g o f)p = gpO fp. The fact that a map of 
simplicial complexes induces a map on homology that is compatible with composition is called 


functoriality, and it is crucial for the definition of persistent homology (see Section 4.11. 


When working with simplicial complexes, one can modify a simplicial complex by removing 
or adding a pair of simplices (cr, r), where r is a codimension-1 face of a and a is the only 
simplex that has r as a face. The resulting simplicial complex has the same homology as the one 
with which we started. In Fig. (a), we can remove the pair ({a, b, c}, {b, c}) and then the pair 
({a, b}, {6}) without changing the Betti numbers. Such a move is called an elementary simplicial 
collapse [31| . In Section 5.2.6[ we will see an application of this for the computation of PH. 

In this section, we have defined simplicial homology over the field F 2 — i.e., “with coefficients 
in F 2 .” One can be more general and instead define simplicial homology with coefficients in any 
field (or even in the integers). However, when 1^—1, one needs to take more care when defining 
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(a) We compute the simplicial homology for the simplicial complex in Fig. |l|[a)[ We have 
the following sequence of vector spaces and linear maps: 


0 


Fo 


^2. 




^1. 




do 


0 . 


Let abc denote the basis vector that corresponds to the simplex {a, 6, c}. Similarly, 
we use ab, ac, ad, be, and cd to denote the basis vectors that correspond to the 1- 
simplices; and we use a, b, c, d, and e to denote the basis vectors that correspond to 
the 0-simplices. We order the bases of the vector spaces using lexicographic order. We 
then have 


d2 = {I 1 0 1 0 ) 


and 


di — 


/I 

1 

0 

0 

VO 


0\ 

0 

1 

1 

0 / 


One can then compute that l3o{K) = 2, f3i{K) = 1, and all higher Betti numbers are 

0 . 


(b) The map of simplicial complexes, 
Jo: Co{K) ^ Co(X' 


f: K ^ K', in Fig. [fb) 


induces the map 
on chain complexes that sends the basis element a to h (note 


that we use the same notation as in part |(a)| of this box). Furthermore, it sends the 
basis elements that correspond to the other 0-simplices of K to g, and it induces the 
map /i: Ci{K) —>• Ci{K') that sends the basis elements ab, ac, and ad to cd and sends 
the basis elements cd and &c to 0. This gives a map /o on degree-0 homology that sends 
both generators of Ho{K) to the generator of Hq{K'). It follows that Hi{K') = 0 for 
all z > 1, so fi is the zero map for z > 1. 


Figure 4: (a) Computation of simplicial homology for the simplicial complex in Fig. 

induced map in degree-0 homology for the map of simplicial complexes in Fig. 


b) 


a) and (b) 


the boundary maps dp to ensure that dpodp^i remains the zero map. Consequently, the definition 
is more involved. For the purposes of the present paper, it suffices to consider homology with 
coefficients in the held F 2 . Indeed, we will see in Section]^ that to obtain topological summaries 
in the form of barcodes, we need to compute homology with coefficients in a held. Furthermore, 


as we summarize in Table! 
of PH work with F 2 . See [( 
computational homology. 


(in Section , most of the implementations for the computation 
1 ] for an introduction to homology, and see 76 for an overview of 


3.2 Building simplicial complexes 

As we discussed in Section |3.I[ computing the homology of hnite simplicial complexes boils 
down to linear algebra. The same is not true for the homology of an arbitrary space X, and one 
therefore tries to hnd simplicial complexes whose homology approximates the homology of the 
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space in an appropriate sense. 

An important tool is the Cech (C) complex. Let U he a. cover of A — i.e., a collection of 
subsets of X such that the union of the subsets is X. The fc-simplices of the Cech complex are 
the non-empty intersections of A: -I- 1 sets in the cover lA. More precisely, we define the nerve of 
a collection of sets as follows. 


Definition 3 LetlA = be a non-empty collection of sets. The nerve ofU is the simplicial 

complex with set of vertices given by I and k-simpliees given by {io, ■ ■ • j if and only if Ui^ n 

• • • n U,, ^ 0. 


If the cover of the sets is sufficiently “nice,” then the Nerve Theorem implies that the nerve 
of the cover and the space X have the same homology 


13,51 


For example, suppose that we 
have a finite set of points ^ in a metric space X. We then can define, for every e > 0, the space 
Se as the union UxesB{x,e), where B{x,e) denotes the closed ball with radius e centered at x. 
It follows that {i3(a;, e) I a; G S'} is a cover of Se, and the nerve of this cover is the Cech complex 
on S at scale e. We denote this complex by Ce(S). If the space X is Euclidean space, then the 
Nerve Theorem guarantees that the simplicial complex C'e(S) recovers the homology of Sg. 

From a computational point of view, the Cech complex is expensive because one has to 
check for large numbers of intersections. Additionally, in the worst case, the Cech complex can 
have dimension \hl\ — 1, and it therefore can have many simplices in dimensions higher than 
the dimension of the underlying space. Ideally, it is desirable to construct simplicial complexes 
that approximate the homology of a space but are easy to compute and have “few” simplices, 
especially in high dimensions. This is a subject of ongoing research: In Subsection |5.2[ we give 
an overview of state-of-the-art methods to asso ciate c omplexes to point-cloud data in a way that 
addresses one or both of these desiderata. See |5l||56| for more details on the Cech complex, and 
see 


13,51 for a precise statement of the Nerve Theorem. 


4 Persistent homology 

Assume that we are given experimental data in the form of a finite metric space S'; there are 
points or vectors that represent measurements along with some distance function (e.g., given 
by a correlation or a measure of dissimilarity) on the set of points or vectors. Whether or not 
the set S is a sample from some underlying topological space, it is useful to think of it in those 
terms. Our goal is to recover the properties of such an underlying space in a way that is robust 
to small perturbations in the data S. In a broad sense, this is the subject of topological inference. 
(See |104| for an overview.) If S' is a subset of Euclidean space, one can consider a “thickening” Sg 
of S given by the union of balls of a certain fixed radius e around its points and then compute the 
Cech complex. One can thus try to compute qualitative features of the data set S by constructing 
the Cech complex for a chosen value e and then computing its simplicial homology. The problem 
with this approach is that there is a priori no clear choice for the value of the parameter e. The 
key insight of PH is the following: To extract qualitative information from data, one considers 
several (or even all) possible values of the parameter e. As the value of e increases, simplices are 
added to the complexes. Persistent homology then captures how the homology of the complexes 
changes as the parameter value increases, and it detects which features “persist” across changes 
in the parameter value. We give an example of persistent homology in Fig. 

4.1 Filtered complexes and homology 

Let AT be a finite simplicial complex, and let Ki C K 2 C ■■■ C Ki = K he a finite sequence of 
nested subcomplexes of K. The simplicial complex K with such a sequence of subcomplexes is 
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called a filtered simplicial complex. See Fig. j^a) for an example of filtered simplicial complex. 
We can apply homology to each of the subcomplexes. For all homology degrees p, the inclusion 
maps Ki —>■ Kj induce F 2 -linear maps fij : Hp(Ki) —> Hp{Kj) for all j S {1, ■ ■ ■ ,1} with i < j. 


By functoriality (see Section 3.11, it follows that 


fk,j O fi,k — fi,j 

We therefore give the following definitior0 


for all i < k < j. 


( 1 ) 


Definition 4 Let Ki C K 2 G ■ ■ ■ G Ki = K be a filtered simplicial complex. The pth persistent 
homology of K is the pair 

{{Hpim l<i<l ’ > 

where for all i,j G {1 ,... ,1} with i < j, the linear maps fij : Hp{Ki) -G Hp{Kj) are the maps 
induced by the inclusion maps Ki -G Kj. 


The pth persistent homology of a filtered simplicial complex gives more refined information 
than just the homology of the single subcomplexes. We can visualize the information given by 
the vector spaces Hp{Ki) together with the linear maps fij by drawing the following diagram: at 
filtration step i, we draw as many bullets as the dimension of the vector space Hp{Ki). We then 
connect the bullets as follows: we draw an interval between bullet u at filtration step i and bullet 
V at filtration step i + 1 if the generator of Hp{Ki) that corresponds to u is sent to the generator 
of Hp{Ki+x) that corresponds to v. If the generator corresponding to a bullet u at filtration step 
i is sent to 0 by /iy+i, we draw an interval starting at u and ending at * + l. (See Fig.j^b) for an 
example.) Such a diagram clearly depends on a choice of basis for the vector spaces Hp(Ki), and 
a poor choice can lead to complicated and unreadable clutter. Fortunately, by the Fundamental 
Theorem of Persistent Homology 137 , there is a choice of basis vectors of Hp{Ki) for each 
i G {!,...,?} such that one can construct the diagram as a well-defined and unique collection 
of disjoint half-open intervals, collectively called a barcod^ We give an example of a barcode 
in Fig. I^c). Note that the Fundamental Theorem of PH, and hence the existence of a barcode, 
relies on the fact that we are using homology with field coefficients. (See |137| for more details.) 

There is a useful interpretation of barcodes in terms of births and deaths of generators. 
Considering the maps /i j- written in the basis given by the Fundamental Theorem of Persistent 
Homology, we say that x G Hp{Ki) (with a; 0) is born in Hp{Ki) if it is not in the image of 
fi-ij (i.e., /“L\ fix) = 0). For x G Hp{Ki) (with x 0), we say that x dies in Hp{Kj) li j > i 
is the smallest index for which fij{x) = 0. The lifetime of x is represented by the half-open 
interval [i,j). If fijix) 0 for all j > i in /, we say that x lives forever, and its lifetime is 
represented by the interval [i, 00 ). 


Remark 5 Note that some references (e.g., introduce persistent homology by defining the 

birth and death of generators without using the existence of a choice of compatible bases, as given 
by the Fundamental Theorem of Persistent Homology. The definition of birth coincides with the 

■^A pair ({MHig/, : Mi Mj}i^j), where (/,<) is a totally ordered set, such that for each i, we have 
that Mi is a vector space and the maps (j>ij are linear maps satisfying the functoriality property Q, is usually 
called a persistence module. With this terminology, the homology of a filtered simplicial complex is an example 
of persistence module. 

^Although the collection of intervals is unique, note that one has to choose a vertical order when drawing the 
intervals in the diagram, and there is therefore an ambiguity in the representation of the intervals as a barcode. 
However, there is no ambiguity when representing the intervals as points in a persistence diagram (see Fig. Rd)). 
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Figure 5: Example of persistent homology for a finite filtered simplicial complex, (a) We start with 
a finite filtered simplicial complex, (b) At each filtration step i, we draw as many vertices as the 
dimension of (left column) Ho{Ki) and (right column) Hi{Ki). We label the vertices by basis elements, 
the existence of which is guaranteed by the Fundamental Theorem of Persistent Homology, and we draw 
an edge between two vertices to represent the maps fij, as explained in the main text. We thus obtain 
a well-defined collection of disjoint half-open intervals called a “barcode.” We interpret each interval 
in degree p as representing the lifetime of a p-homology class across the filtration, (c) We rewrite the 
diagram in (b) in the conventional way. We represent classes that are born but do not die at the 
final filtration step using arrows that start at the birth of that feature and point to the right, (d) An 
alternative graphical way to represent barcodes (which gives exactly the same information) is to use 

persistence diagrams, in which an interval [i,j) is represented by the point {i,j) in the extended plane 

_ 2 _ _ 2 

R , where R = R U {cio}. Therefore, a persistence diagram is a finite multiset of points in R . We use 

squares to signify the classes that do not die at the final step of a filtration, and the size of dots or 
squares is directly proportional to the number of points being represented. For technical reasons, which 
we discuss briefly in Section 5.4 one also adds points on the diagonal to the persistence diagrams. (Each 
of the points on the diagonal has infinite multiplicity.) 
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definition that we have given, but the definition of death is different. One says that x S Hp{Ki) 
(with X 0) dies in Hp{Kj) if j > i is the smallest index for which either fij{x) = 0 or there 
exists y G Hp(Kii) with i' < i such that fi>,j{y) = fij{x). In words, this means that x and y 
merge at filtration step j, and the class that was born earlier is the one that survives. In the 
literature, this is called the elder rule. We do not adopt this definition, because the elder rule is 
not well-defined when two classes are born at the same time, as there is no way to choose which 
class will survive. For example, in Fig. there are two classes in Hq that are born at the same 
stage in Ki. These two classes merge in K 2 , but neither dies. The class that dies is [a] + [c]. 


5Tp^[l04p^ and the 


There are numerous excellent introductions to PH, such as the books 
papers 22 49 50 61 . For a brief and friendly introduction to PH and some of its applications, 
see the video https://www.youtube. com/watch?v=hObnGlWavag, For a brief introduction to 
some of the ideas in TDA, see the video https ://www.youtube. com/watch?v=XfWibrh6stw 


5 Computation of PH for data 

We summarize the pipeline for the computation of PH from data in Fig. In the following 
subsections, we describe each step of this pipeline and state-of-the-art algorithms for the compu¬ 
tation of PH. The two features that make PH appealing for applications are that it is computable 
via linear algebra and that it is stable with respect to perturbations in the measurement of data. 
In Section [ 57 ^ we give a brief overview of stability results. 



Figure 6: PH pipeline. 


5.1 Data 

As we mentioned in Section [l] types of data sets that one can study with PH include finite metric 
spaces, digital images, and networks. We now give a brief overview of how one can study these 
types of data sets using PH. 


5.1.1 Networks 


One can construe an undirected network as a 1-dimensional simplicial complex. If the network 
is weighted, then filtering by increasing or decreasing weight yields a filtered 1-dimensional 
simplicial complex. To obtain more refined information about the network, it is desirable to 
construct higher-dimensional simplices. There are various methods to do this. The simplest 
method, called a weight rank clique filtration (WRCF), consists of building a clique complex on 
each subnetwork. (See Section 5.2.1 for the definition of “clique complex.”) We refer the reader 
to 108 for an application of this method. Another method to study networks with PH consists 


of mapping the nodes of the network to points of a finite metric space. There are several ways to 
compute distances between nodes of a network; the method that we use in our benchmarking in 
Section consists of computing a shortest path between nodes. For such a distance to be well- 
defined, note that one needs the network to be connected (although conventionally one takes 
the distance between nodes in different components to be infinity). There are many methods 
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to associate an unfiltered simplicial complex to both undirected and directed networks. See the 
book 72 for an overview of such methods, and see the paper 71 for an overview of PH for 


networks. 


5.1.2 Digital images 

As we mentioned in Section digital images have a natural cubical structure: 2-dimensional 
digital images are made of pixels, and 3-dimensional images are made of voxels. Therefore, to 
study digital images, cubical complexes are more appropriate than simplicial complexes. Roughly, 
cubical complexes are spaces built from a union of vertices, edges, squares, cubes, and so on. 
One can compute PH for cubical complexes in a similar way as for simplicial complexes, and we 
will therefore not discuss this further in this paper. See 76 for a treatment of computational 


homology with cubical complexes rather than simplicial complexes and for a discussion of the 


relationship between simplicial and cubical homology. See 126 for an efficient algorithm and data 
structure for the computation of PH for cubical data, and 10 for an algorithm that computes 


PH for cubical data in an approximate way. For an application of PH and cubical complexes to 
movies, see 


101 


Other approaches for studying digital images are also useful. In general, given a digital image 
that consists of N pixels or voxels, one can consider this image as a point in a c x iV-dimensional 
space, with each coordinate storing a vector of length c representing the color of a pixel or voxel. 
Defining an appropriate distance function on such a space allows one to consider a collection of 
images (each of which has N pixels or voxels) as a finite metric space. A version of this approach 
was used in [^, in which the local structure of natural images was studied by selecting 3x3 
patches of pixels of the images. 


5.1.3 Finite metric spaces 

As we mentioned in the previous two subsections, both undirected networks and image data can 
be construed as finite metric spaces. Therefore, methods to study finite metric spaces with PH 
apply to the study of networks and image data sets. 

In some applications, points of a metric space have associated “weights.” For instance, in the 
study of molecules, one can represent a molecule as a union of balls in Euclidean space [131 132 


For such data sets, one would therefore also consider a minimum filtration value (see Section 5.2 
for the description of such Hltration values) at which the point enters the filtration. In Table 
we indicate which software libraries implement this feature. 

5.2 Filtered Simplicial Complexes 


In Section 3.2 we introduced the Cech complex, a classical simplicial complex from algebraic 


topology. However, there are many other simplicial complexes that are better suited for studying 
data from applications. We discuss them in this section. 

To be a useful tool for the study of data, a simplicial complex has to satisfy some theoretical 
properties dictated by topological inference; roughly, if we build the simplicial complex on a set 
of points sampled from a space, then the homology of the simplicial complex has to approximate 
the homology of the space. For the Cech complex, these properties are guaranteed by the 
Nerve Theorem. Some of the complexes that we discuss in this subsection are motivated by a 
“sparsification paradigm”: they approximate the PH of known simplicial complexes but have 
fewer simplices than them. Others, like the Vietoris-Rips complex, are appealing because they 
can be computed efficiently. In this subsection, we also review reduction techniques, which are 
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heuristics that reduce the size of complexes without changing the PH. In Table we summarize 
the simplicial complexes that we discuss in this subsection. 

For the rest of this subsection (X, d) denotes a metric space, and S' is a subset of X, which 
becomes a metric space with the induced metric. In applications, S is the collection of measure¬ 
ments together with a notion of distance, and we assume that S lies in the (unknown) metric 
space X. Our goal is then to compute persistent homology for a sequence of nested spaces 
Sej, 3^2 j • ■ •, Se,, where each space gives a “thickening” of S in X. 


5.2.1 Vietoris Rips complex 

We have seen that one of the disadvantages of the Cech complex is that one has to check for a 
large number of intersections. To circumvent this issue, one can instead consider the Vietoris- 
Rips (VR) complex, which approximates the Cech complex. For a non-negative real number e, 
the Vietoris-Rips complex VRe(S') at scale e is defined as 


VRe(S') = {cr C 5” I d(a:, y) < 2e for all x,y £ a} . 


The sense in which the VR complex approximates the Cech complex is that, when S' is a 
subset of Euclidean space, we have C'e(S) C VRc(S) C C^^{S). Deciding whether a subset 
(7 C S is in VRe(S) is equivalent to deciding if the maximal pairwise distance between any two 
vertices in a is at most 2e. Therefore, one can construct the VR complex in two steps. One first 
computes the e-neighborhood graph of S. This is the graph whose vertices are all points in S and 
whose edges are 

{{ij) € S X S \ i ^ j and d{i,j) < 2e}. 


Second, one obtains the VR complex by computing the clique complex of the e-neighborhood 
graph. The clique complex of a graph is a simplicial complex that is defined as follows: The 
subset {xq, ..., Xk} is a A:-simplex if and only if every pair of vertices in {xq, ..., Xk} is connected 
by an edge. Such a collection of vertices is called a clique. This construction makes it very easy 
to compute the VR complex, because to construct the clique complex one has only to check for 
pairwise distances — for this reason clique complexes are also called “lazy” in the literature. 
Unfortunately, the VR complex has the same worst-case complexity as the Cech complex. In the 
worst case, it can have up to 2l‘^l — 1 simplices and dimension jS”! — 1. 

In applications, one therefore usually only computes the VR complex up to some dimension 
A: |S'| — 1. In our benchmarking, we often choose k = 2 and fc = 3. 

The paper 134 overviews different algorithms to perform both of the steps for the construc- 


complex by the VR complex, see 51 


, see 

51 

125 

51 

; see 

79 


For a proof of the approximation of the Cech 


79 for a generalization of this result. 


5.2.2 The Delaunay complex 

To avoid the computational problems of the Cech and VR complexes, we need a way to limit 
the number of simplices in high dimensions. The Delaunay complex gives a geometric tool to 
accomplish this task, and most of the new simplicial complexes that have been introduced for 
the study of data are based on variations of the Delaunay complex. The Delaunay complex and 
its dual, the Voronoi diagram, are central objects of study in computational geometry because 
they have many useful properties. 

For the Delaunay complex, one usually considers X = so we also make this assumption. 
We subdivide the space into regions of points that are closest to any of the points in S. More 
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precisely, for any s G S, we define 

14 = {ic € I d(x, s) < d(a:, s') for all s' G S} . 

The collection of sets 14 is a cover for that is called the Voronoi decomposition of with 
respect to S, and the nerve of this cover is called the Delaunay complex of S and is denoted 
by Del(S';K‘^). In general, the Delaunay complex does not have a geometric realization in 
However, if the points S are “in general position” |^then the Delaunay complex has a geometric 
realization in that gives a triangulation of the convex hull of S. In this case, the Delaunay 
complex is also called the Delaunay triangulation. 

The complexity of the Delaunay complex depends on the dimension d of the space. For 
d < 2, the best algorithms have complexity 0{N\ogN), where N is the cardinality of S. For 
d > 3, they have complexity The construction of the Delaunay complex is therefore 

costly in high dimensions, although there are efficient algorithms for the computation of the 
Delaunay complex for d = 2 and d = 3. Developing efficient algorithms for the construction 
of the Delaunay complex in higher dimensions is a subject of ongoing research. See 14 for a 
discussion of progress in this direction, and see 


65 for more details on the Delaunay complex 


and the Voronoi diagram. 


5.2.3 Alpha complex 


We continue to assume that S' is a finite set of points in Using the Voronoi decomposition, 
one can define a simplicial complex that is similar to the Cech complex, but which has the desired 
property that (if the points S are in general position) its dimension is at most that of the space. 
Let e > 0, and let S^ denote the union i?(s, e). For every s G S, consider the intersection 
14 n B{s, e). The collection of these sets forms a cover of Se, and the nerve complex of this cover 
is called the alpha (a) complex of S at scale e and is denoted by ^^(S). The Nerve Theorem 
applies, and it therefore follows that Ae(S') has the same homology as S^. 

Furthermore, A^oiS) is the Delaunay complex; and for e < oo, the alpha complex is a 
subcomplex of the Delaunay complex. The alpha complex was introduced for points in the plane 

and for Euclidean spaces of arbitrary dimension 


52 


48 


, in 3-dimensional Euclidean space in 55 


For points in the plane, there is a well-known speed-up for the alpha complex that uses a 


duality between 0-dimensional and I-dimensional persistence for alpha complexes (50| 
for the algorithm, and see 


(See 84 


83 for an implementation.) 


5.2.4 Witness complexes 

Witness complexes are very useful for analyzing large data sets, because they make it possible 
to construct a simplicial complex on a significantly smaller subset L C S oi points that are 
called “landmark” points. Meanwhile, because one uses information about all points in S to 
construct the simplicial complex, the points in S are called “witnesses.” Witness complexes 
can be construed as a “weak version” of Delaunay complexes. (See the characterization of the 
Delaunay complex in [^.) 

Definition 6 Let {S, d) be a metric space, and let L C S be a finite subset. Suppose that a is 
a non-empty subset of L. We then say that s G S is a weak witness for a with respect to L if 

®A set S of points in is in general position if no d -|- 2 points of S lie on a d-dimensional sphere, and for 
any d' < d, no d' -|- 2 points of S lie on a d'-dimensional subspace that is isometric to M'* . In particular, a set of 
points S in is in general position if no four points lie on a 2-dimensional sphere and no three points lie on a 
line. 
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and only if d(s, a) < d(s, b) for all a G a and for all b € L \ a. The weak Delaunay complex 
Del“(L; S) of S with respect to L has vertex set given by the points in L, and a subset a of L is 
in Del’"(L; S) if and only if it has a weak witness in S. 

To obtain nested complexes, one can extend the definition of witnesses to e-witnesses. 

Definition 7 A point s € S is a weak e-witness for a with respect to L if and only if d{s,a) < 
d(s, 6) -|- e for all a G a and for all b G L\a. 

Now we can define the weak Delaunay complex Del“(L;S', e) at scale e to be the simplicial 
complex with vertex set L, and such that a subset cr C L is in Del“'(L; S', e) if and only if it has a 
weak e-witness in S. By considering different values for the parameter e, we thereby obtain nested 
simplicial complexes. The weak Delaunay complex is also called the “weak witness complex” or 
just the “witness complex” in the literature. 

There is a modification of the witness complex called the lazy witness complex X, e). 

It is a clique complex, and it can therefore be computed more efficiently than the witness complex. 
The lazy witness complex has the same 1-skeleton as Del“'(L; X, e), and one adds a simplex a to 
DeV{f^^{L-,X,e) whenever its edges are in X, e). Another type of modification of the 

witness complex yields parametrized witness complexes. Let v = 1,2,... and for all s G S' define 
m^{s) to be the distance to the vth. closest landmark point. Furthermore, define too(s) = 0 
for all s G S. Let Wy(L;S, e) be the simplicial complex whose vertex set is L and such that a 
1-simplex a = {xq, Xi\ is in W^(L; X^ e) if and only if there exists s in S for which 

max{d(a;o, s), d(a;i, s)} < m,j{s) -I- e. 

A simplex a is in Wy(L; X, e) if and only if all of its edges belong to W^{L-,X,e). For v = 2, 
note that W 2 (L; X, e) = X, e). For = 0, we have that Wo(L; X, e) approximates the 

VR complex VR(L; e). That is, 

Wo(L; X, e) C VR(L; 2e) C Wo(L; AT, 2e). 


Note that parametrized witness complexes are often called 
literature, because they are clique complexes. 

The weak Delaunay complex was introduced in 
were introduced in 


“lazy witness complexes” in the 


37 , and parametrized witness complexes 


38 . Witness complexes can be rather useful for applications. Because their 


complexity depends on the number of landmark points, one can reduce complexity by computing 
simplicial complexes using a smaller number of vertices. However, there are theoretical guaran¬ 
tees for the witness complex only when S is the metric space associated to a low-dimensional 
Euclidean submanifold. It has been shown that witness complexes can be used to recover the 
topology of curves and surfaces in Euclidean space [4 67 , but they can fail to recover topology 


for submanifolds of Euclidean space of three or more dimensions 15 . Consequently, there have 


been studies of simplicial complexes that are similar to the witness complexes but with better 
theoretical guarantees. One of them is the “graph-induced complex” (see Section 5.2.5). 


5.2.5 Additional complexes 

Many more complexes have been introduced for the fast computation of PH for large data sets. 
These include the graph-induced complex 44 , which is a simplicial complex constructed on a 
subsample Q, and has better theoretical guarantees than the witness complex (see 74 for the 


companion software); an approxi matio n of the VR complex that has a worst -cas e size that is 
linear in the number of data points 


113 


an approximation of the Cech complex 


79 


whose worst- 


case size also scales linearly in the data; and an approximation of the VR complex via simplicial 
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Table 1: We summarize several types of complexes that are used for PH. We indicate the theoretical 
guarantees and the worst-case sizes of the complexes as functions of the cardinality N of the vertex set. 
For the witness complexes (see Section 5.2.4 1 , L denotes the set of landmark points, while Q denotes the 
subsample set for the graph-induced complex (see Section 5.2.51. 


Complex K 

Size of K 

Theoretical guarantee 

Cech 

QpW) 

Nerve theorem 

Vietoris-Rips (VR) 


Approximates Cech complex 

Alpha (a) 

ArO(|d/2|) 

(TV points in R'*) 

Nerve theorem 

Witness 


For curves and surfaces in Euclidean space 

Graph-induced complex 

20(|g|) 

Approximates VR complex 

Sparsified Cech 

C>(TV) 

Approximates Cech complex 

Sparsified VR 

0{N) 

Approximates VR complex 


collapses [^. We do not discuss such complexes in detail, because thus far (at the time of 
writing) none of them have been implemented in publicly-available libraries for the computation 
of PH. (SeeTable|^ in Sectionj^for information about which complexes have been implemented.) 


5.2.6 Reduction techniques 

Thus far, we have discussed techniques to build simplicial complexes with possibly “few” sim- 
plices. One can also take an alternative approach to speed up the computation of PH. For 
example, one can use a heuristic (i.e., a method without theoretical guarantees on the speed-up) 
to reduce the size of a filtered complex while leaving the PH unchanged. 


For simplicial complexes, one such method is based on discrete Morse theory 111 , which was 
adapted to filtrations of simplicial complexes in 96 . The basic idea of the algorithm developed 


in [96] is that one can compute a partial matching of the simplices in a filtered simplicial complex 
so that (i) pairs occur only between simplices that enter the filtration at the same step, (ii) 
unpaired simplices determine the homology, and (iii) one can remove paired simplices from the 
filtered complex without altering the total PH. Such deletions are examples of the elementary 
simplicial collapses that we mentioned in Section [3.1| Unfortunately, the problem of finding an 
optimal partial matching was shown to be NP complete 73 , and one thus relies on heuristics to 
find partial matchings to reduce the size of the complex. 

A method for the reduction of the size of a complex for clique complexes, such as the VR 
complex, was introduced in 135 and is called the tidy-set method. Using maximal cliques. 


this method extracts a minimal representation of the graph that determines the clique complex. 
Although the tidy-set method cannot be extended to filtered complexes, it can be used for the 
computation of zigzag PH (see Section]^ |136| . The tidy-set method is a heuristic, because it 
does not give a guarantee to minimize the size of the output complex. 


5.3 From a Filtered Simplicial Complex to Barcodes 

To compute the PH of a filtered simplicial complex K and obtain a barcode like the one illustrated 
in Fig. [^c), we need to associate to it a matrix — the so-called boundary matrix — that stores 
information about the faces of every simplex. To do this, we place a total ordering on the 
simplices of the complex that is compatible with the filtration in the following sense: 

• a face of a simplex precedes the simplex; 
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• a simplex in the ith complex Ki precedes simplices in Kj for j > i, which are not in K^. 

Let n denote the total number of simplices in the complex, and let cti ,..., denote the simplices 
with respect to this ordering. We construct a square matrix S of dimension n x n by storing a 
1 in 6{i,j) if the simplex ai is a face of simplex aj of codimension 1; otherwise, we store a 0 in 

HiJ)- 

Once one has constructed the boundary matrix, one has to reduce it using Gaussian elimi- 
nationj^ In the following subsections, we discuss several algorithms for reducing the boundary 
matrix. 


5.3.1 Standard algorithm 


The so-called standard algorithm for the computation of PH was introduced for the held F 2 
in 53 and for general helds in [137 . For every j G_[l, ■ • •, n}, we dehne low(j) to be the largest 


index value i such that S{i,j) is different from Oj^ If column j only contains 0 entries, then 
the value of low(j) is undehned. We say that the boundary matrix is reduced if the map low is 
injective on its domain of dehnition. In Alg. we illustrate the standard algorithm for reducing 
the boundary matrix. Because this algorithm operates on columns of the matrix from left to 
right, it is also sometimes called the “column algorithm.” In the worst case, the complexity of 
the standard algorithm is cubic in the number of simplices. 


Algorithm 1 The standard algorithm for the reduction of the boundary matrix to barcodes. 

for j = 1 to n do 

while there exists i < j with low( 2 ) = low(j') do 
add column i to column j 

end while 
end for 


5 . 3.2 Reading off the intervals 

Once the boundary matrix is reduced, one can read off the intervals of the barcode by pairing 
the simplices in the following way: 

• If low(j) = i, then the simplex aj is paired with ai, and the entrance of ai in the filtration 
causes the birth of a feature that dies with the entrance of aj. 

• If low(j) is undefined, then the entrance of the simplex aj in the filtration causes the birth 
of a feature. It there exists k such that low(fc) = j, then aj is paired with the simplex ak, 
whose entrance in the filtration causes the death of the feature. If no such k exists, then 
aj is unpaired. 

A pair {ai,aj) gives the half-open interval [dg(cri), dg(crj)) in the barcode, where for a simplex 
a € K we define dg(cr) to be the smallest number I such that a £ Ki. An unpaired simplex au 
gives the infinite interval [dg(crfc), 00 ). We give an example of PH computation in Fig. 

^As we mentioned in Section [^ for the reduction of the boundary matri x an d thus the computation of PH, it 
is crucial that one uses simplicial homology with coefficients in a field; see [137] for details. 

®This map is called “low” in the literature, because one can think of it as indicating the index of the “lowest” 
row — the one that is nearest to the bottom of the page on which one writes the boundary matrix — that contains 
a 1 in column j. 
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(a) A filtered simplicial complex: 
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(b) We put a total order on the simplices that is compatible with the filtration: 
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where cr^ denotes the ith simplex in this order. 


(c) (Left) The boundary m atrix B for the filtered simplicial complex in (a) with respect 


to order on simplices in (b) and (right) its reduction B given by applying Alg. (one 


first adds column 5 to column 6 , and then column 4 to column 6 ): 
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(d) We read off the following intervals from the matrix B in (c) 

• cTi is positive, unpaired; this gives the interval [1,cxd) in Bg. 

• a 2 is positive, paired with 174 ; this gives no interval, because tT 2 and <74 enter at 
the same time in the filtration. 

• cr 3 is positive, paired with < 75 : this gives the interval [2,3) in Bq. 

• (76 is positive, paired with ( 77 : this gives the interval [3,4) in Bi. 


Figure 7: Example of PH computation with the standard algorithm (see Alg.nl 
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5.3.3 Other algorithms 


After the introduction of the standard algorithm, several new algorithms were developed. Each 
of these algorithms gives the same output for the computation of PH, so we only give a brief 
overview and references to these algorithms, as one does not need to know them to compute 


PH with one of the publicly-available software packages. In Section 7.2 we indicate which 
implementation of these libraries is best suited to which data set. 

As we mentioned in Section [5.3. 1[ in the worst case, the standard algorithm has cubic com¬ 
plexity in the number of simplices. This bound is sharp, as Morozov gave an example of a 
complex with cubic complexity in 98 . Note that in cases such as when matrices are sparse. 


complexity is less than cubic. Milosavljevic, Morozov, and Skraba 95 introduced an algorithm 


for the reduction of the boundary matrix in (!l(n“), where lo is the matrix-multiplication co¬ 
efficient (i.e., 0{n‘^) is the complexity of the multiplication of two square matrices of size n). 
At present, the best bound for w is 2.376 [^. Many other algorithms have been proposed for 
the reduction of the boundary matrix. These algorithms give a heuristic speed-up for many 
data sets and complexes (see the benchmarkings in the forthcoming references), but they still 
have cubic complexity in the number of simplices. Sequential algorithms include the twist 
algorithm and the dual algorithm 40|[^ . (Note that the dual algorithm is known to give 
a speed-up when one computes PH with the VR complex, but not necessarily for other types 
of complexes (see also the results of our benchmarking for the vertebra data set in the SI).) 
Parallel algorithms in a shared setting include the spectral-sequence algorithm 51 Section 
VH.4] and the chunk algorithm [^, and parallel algorithms in a distributed setting include the 
distributed algorithm [^. The multifield algorithm is a sequential algorithm that allows 
the simultaneous computation of PH over several fields pb] . 


5.4 Statistical interpretation of topological snmmaries 

Once we have obtained the barcodes, we need to interpret the results of our computations. In 
applications, one often wants to compare the output of a computation for a certain data set with 
the output for a null model. Alternatively, one may be studying data sets from the output of 
a generative model (e.g., many realizations from a model of random networks), and it is then 
necessary to average results over multiple realizations. In the first instance, one needs both a 
way to compare the two different outputs and a way to evaluate the significance of the result 
for the original data set. In the second case, one needs a way to calculate appropriate averages 
(e.g., summary statistics) of the result of the computations. 

If one wants to use PH in applications, one thus needs a reliable way to apply statistical 
methods to the output of the computation of PH. To our knowledge, statistical methods for 
PH were addressed for the first time in the paper 20 . Roughly, there are two basic types of 


current approaches to this problem. In the first approach, one studies properties of a metric space 
whose points are persistence diagrams, while in the second approach one studies “features” of 
persistence diagrams. 

More precisely, in the first approach one considers an appropriately defined “space of persis¬ 
tence diagrams,” defines a distance function on it, studies geometric properties of this space, and 
does standard statistical calculations (means, medians, statistical tests, and so on). Recall that 
a persistence diagram (see Fig. [^for an example) is a multiset of points in and that it gives 
the same information as a barcode. We now give the following precise definition of a persistence 
diagram. 


Definition 8 A persistence diagram is a multiset that is the union of a finite multiset of points 
in R with the multiset of points on the diagonal A = {(x,y) G R^ | a; = y}, where each point on 
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the diagonal has infinite multiplicity. 

In this definition, we include all of the points on the diagonal in with infinite multiplicity. 
We include the diagonal points for technical reasons; roughly, it is desirable to be able to compare 
persistence diagrams by studying bijections between their points, and persistence diagrams must 
thus be sets with the same cardinality. 

Given two persistence diagrams X and Y, we consider the following general definition of 
distance between X and Y. 


Definition 9 Let p G [l,oo]. The pth Wasserstein distance between X andY is defined as 


Wp[d]{X,Y) 


i/p 


inf 


(p: X^Y 


d[x,(l){x)]P 

.xGX 


for 1 < p < oo and as 

Woo[d](X, y) := inf sup d[a;, (/)(a;)] 

4’- 

for p = oo, where d is a metric on and (p ranges over all bijections from X to Y. 


Usually, one takes d = Lp for 1 < p < oo. One of the most commonly employed distance 
functions is the bottleneck distance Woo[Uoo]- 

The development of statistical analysis on the space of persistence diagrams is an area of 
ongoing research, and at present there are few tools that can be used in applications. We refer 
the reader to the publications 94,99 122| for research in this direction. 

The second approach for the development of statistical tools for PH consists of mapping the 
space of persistence diagrams to spaces (e.g., Banach spaces) that are amenable to statistical 
analysis. Such methods include persistence landscapes (see also the tutorial 19 ), the space 
of algebraic functions [^, and persistence images [^. See the papers 80 117 for applications 
of persistence landscapes. 

The library Dionysus implements the computation of the bottleneck and Wasserstein dis¬ 
tance (for d = Loo), while the Persistence Landscape Toolbox and the TDA Package 
[58| implement many tools for the statistical interpretation of persistent homology. 


5.5 Stability 

As we mentioned in Section PH is useful for applications because it is stable with respect to 
small perturbations in the input data. 

The first stability theorem for PH, proven in 


32 , asserts that, under favorable conditions. 


step (2) in the pipeline in Fig. is 1-Lipschitz with respect to suitable distance functions on 
filtered complexes and the bottleneck distance for barcodes (see Section 5.4). This result was 
generalized in the papers [TspTpT] . Stability for PH is an active area of research; for an overview 
of stability results, their history and recent developments, see |104[ Chapter 3]. 


6 Excursus: Generalized persistence 

One can use the algorithms that we described in Section to compute PH when one has a 
sequence of complexes with inclusion maps that are all going in the same direction, as in the 
following diagram: 

• • • —> Ki-i —>■ Ki —> Kij^i —>.... 
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An algorithm, called the zigzag algorithm, for the computation of PH for inclusion maps that 
do not all go in the same direction, as, e.g., in the diagram 


K,. 




i+l 


was introduced in 23 . In the more general setting in which maps are not inclusions, one can 


still compute PH using the simplicial map algorithm 45 


One may also wish to vary two or more parameters instead of one. This yields multi-filtered 
simplicial complexes, as, e.g., in the following diagram: 



t 


t 


t 


-> 








t 


t 


t 


-> 

Ayy—1 







t 


t 


t 










t 


T 


r 



In this case, one speaks of multi-parameter persistent homology. Unfortunately, the Fundamental 
Theorem of Persistent Homology is no longer valid if one filters with more than one parameter, 
and there is no such thing as a “generalized interval.” The topic of multi-parameter persistence 
is under active research, and several approaches are being studied to extract topological infor¬ 
mation from multi-filtered simplicial complexes. See 25 104| for the theory of multi-parameter 
persistent homology, and see (and for its companion paper) for upcoming software for 
the visualization of 2-parameter persistent homology. 


7 Software 


There are several publicly-available implementations for the computation of PH. We give an 
overview of the libraries with accompanying peer-reviewed publication and summarize their prop¬ 
erties in Table 12] 

The software package javaPlex 118 , which was developed by the computational topology 
group at Stanford University, is based on the Plex library 107 , which to our knowledge is the 


first piece of software to implement the computation of PH. Perseus [100 was developed to im¬ 


plement Morse-theoretic reductions 96 


(see Section 5.2.6). jHoles 12 is a Java library for com- 

[108] ■ Dionysus 


IS 


puting the weight rank clique filtration for weighted undirected networks 
the first software package to implement the dual algorithm 40 41 . PHAT is a library that 


implements several algorithms and data structures for the fast computation of barcodes, takes a 
boundary matrix as input, and is the first software to implement a matrix-reduction algorithm 
that can be executed in parallel. DIPHA |^, a spin-off of PHAT, implements a distributed 
computation of the matrix-reduction algorithm. Gudhi [93] implements new data structures for 
simplicial complexes and the boundary matrix. It also implements the multi-field algorithm, 
which allows simultaneous computation of PH over several fields 16 . This library is currently 


under intense development. The library ripser [^, the most recently developed software of the 
set that we examine, uses several optimizations and shortcuts to speed up the computation of PH 
with the VR complex. This library does not have an accompanying peer-reviewed publication. 



























24 


However, because it is currently the best-performing (both in terms of memory usage and in 
terms of wall-time secondt]^ library for the computation of PH with the VR complex, we in¬ 
clude it in our study. The library SimpPers 75 implements the simplicial map algorithm. 


Libraries that implement techniques for the statistical interpretation of barcodes include the 
TDA Package 


58 and the Persistence Landscape Toolbox 47 


RIVET, a package for 
We summarize 
and we discuss 


visualizing 2-parameter persistent homology, is slated to be released soon [891 
the properties of the libraries that we mentioned in this paragraph in Table 
the performance for a selection of them in Section [7.1.3| and in the SI. For a list of programs, see 
https;//github.com/n-otter/PH-roadmap 


7.1 Benchmarking 

We benchmark a subset of the currently available open-source libraries with peer-reviewed pub¬ 
lication for the computation of PH. To our knowledge, the published open-source libraries are 
jHoles, javaPlex, Perseus, Dionysus, PHAT, DIPHA, SimpPers, and Gudhi. To these, 
we add the library ripser, which is currently the best-performing library to compute PH with 
the VR complex. To study the performance of the packages, we restrict our attention to the 
algorithms that are implemented by the largest number of libraries. These are the VR complex 
and the standard and dual algorithms for the reduction of the boundary matrix. PHAT only 
takes a boundary matrix as input, so it is not possible to conduct a direct comparison of it with 
the other implementations. However, the fast data structures and algorithms implemented in 
PHAT are also implemented in its spin-off software DIPHA, which we include in the bench¬ 
marking. The software jHoles computes PH using the WRCF for weighted undirected networks, 
and SimpPers takes a map of simplicial complexes as input, so these two libraries cannot be 
compared directly to the other libraries. In the SI, we report benchmarking of some additional 
features that are implemented by some of the six libraries (i.e., javaPlex, Perseus, Dionysus, 
DIPHA, Gudhi, and ripser) that we test. Specifically, we report results for the computation of 
PH with cubical complexes for image data sets and the computation of PH with witness, alpha, 
and Cech complexes. 

We study the software packages javaPlex, Perseus, Dionysus, DIPHA, Gudhi, and 
RIPSER using both synthetic and real-world data from three different perspectives: 

1. Performance measured in CPU seconds and wall-time (i.e., elapsed time) seconds. 

2. Memory required by the process. 

3. Maximum size of simplicial complex allowed by the software. 


7.1.1 Data sets 


In this subsection, we describe the data sets that we use for our benchmarking. We use data sets 
from a variety of different mathematical and scientific areas and applications. In each case, when 
possible, we use data sets that have already been studied using PH. Our list of data sets is far 
from complete; we view this list as an initial step towards building a comprehensive collection of 
benchmarking data sets for PH. 

Data sets (l) -(^are synthetic: these arise from topology m stochastic topology P)l dy¬ 
namical systems 1(3) [ and from an area at the intersection of network theory and fractal geometry 
m and which was first used to study connection patterns of the cerebral cortex (see below for 
details). Data sets (5)|- (12) are from empirical experiments and measurements: they arise from 


‘Wall time” is the amount of elapsed time perceived by a human. 
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phylogenetics [(5)}|(6) [ genomics | (9) [ neuroscience m image analysis |( 7) [ medical imaging |(10)[ 
political science (11) and scientometrics [(T^ 

In each case, these data sets are of one of the following three types: point clouds, weighted 
undirected networks, and grey-scale digital images. To obtain a point cloud from a real-world 
weighted undirected network, we compute shortest paths using the inverse of the nonzero weights 
on edges as distances between nodes (except for the US Congress networks; see below). For the 
synthetic networks, the values assigned to edges are interpreted as distances between nodes, 
and we therefore use these values to compute shortest paths. We make all processed ver¬ 
sions of the data sets that we use in the benchmarking available at https://github.com/ 
n-otter/PH-roadmap/tree/master/data_sets, We provide the scripts that we used to produce 
the synthetic data sets at https://github.com/n-otter/PH-roadmap/tree/master/matlab/ 
synthetic_data_sets_scripts 

We now describe all data sets in detail: 


(1) Klein bottle. The Klein bottle is a one-sided nonorientable surface (see Fig. [^. We linearly 
sample points from the Klein bottle using its “figure-8” immersion in and size sample 
of 400 points. We denote this data set by Klein. Note that the image of the immersion of 
the Klein bottle does not have the same homotopy type as the original Klein bottle, but 
they do have the same singular homologjj^ with F2 coefficients. We have Hq{B) = F2, 
Hi{B) = F2 © F2, and H 2 {B) = F2, where B denotes the Klein bottle and Hi{B) is the 
ith singular homology group with F2 coefficients. 


( 2 ) 


Random VR complexes (uniform distribution) [77| . The parameters for this model are 
positive integers N and d; the random VR complex for parameters N and d is the VR 
complex VRe(V), where V is a set of N points sampled from (Equivalently, the 
random VR complex is the clique complex on the random geometric graph G{N,e) [105] .) 
We sample N points uniformly at random from [0,1]“^. We choose {N, d) = (50,16) and 
we denote this data set by random. The homology of random VR complexes was studied 


77 


( 3 ) 


Vicsek biological aggregation model, 
studied using PH in |121 


This model was first introduced in 124 and was 


We implement the model in the form in which it appears 
121 . The model describes the motion of a collection of particles that interact in a 


square with periodic boundary conditions. The parameters for the model are the length I 
of the side of the square, the initial angle Oq, the fixed absolute value for the velocity vq, 
the number of particles N, a noise parameter 77, and the number T of time steps. The 
output of the model is a point cloud in 3-dimensional Euclidean space in which each point 
is specified by its position in the 2-dimensional box and its velocity angle. We run three 
simulations of the model using the parameter values used in 121 . For each simulation, we 

for further 


choose two point clouds that correspond to two different time frames. See 
details. We denote this data set by Vicsek. 


121 


(4) Fractal networks. These are self-similar networks introduced in 116 to investigate whether 
connection patterns of the cerebral cortex are arranged in self-similar patterns. The pa¬ 
rameters for this model are natural numbers b, k, and n. To generate a fractal network, 
one starts with a fully-connected network on 2^ nodes. Two copies of this network are 
connected to each other so that the “connection density” between them is k~^, where the 
connection density is the number of edges between the two copies divided by the number of 


Singular homology is a method that assigns to every topological space homology groups encoding invariants 
of the space, in an ana logous way as simplicial homology assigns homology groups to simplicial complexes. We 
refer the reader to |69| for an account of singular homology. 
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total possible edges between them. Two copies of the resulting network are connected with 
connection density k~‘^. One repeats this type of connection process until the network has 
size 2", but with a decrease in the connnection density by a factor of 1/A: at each step. 

We define distances between nodes in two different ways: (1) uniformly at random, and 
(2) with linear weight-degree correlations. In the latter, the distance between nodes i and 
j is distributed as kikjX, where ki is the degree of node i and X is a random variable 
uniformly distributed on the unit interval. We use the parameters b = 5, n = 9, and k = 2; 
and we compute PH for the weighted network and for the network in which all adjacent 
nodes have distance 1. We denote this data set by fract and distinguish between the two 
ways of defining distances between weights using the abbreviations “r” for random, and 
“1” for linear. 


(5) Genomic sequences of the HIV virus. We construct a finite metric space using the indepen¬ 
dent and concatenated sequences of the three largest genes — gag, pol, and env — of the 
HIV genome. We take 1088 different genomic sequences and compute distances between 
them by using the Hamming distance. We use the aligned sequences studied using PH 
in [^. (The authors of that paper retrieved the sequences from [^.) We denote this data 
set by HIV. 

(6) Genomic sequences of H3N2. These are 1000 different genomic sequences of H3N2 influenza. 
We compute the Hamming distance between sequences. We use the aligned sequences 
studied using PH in [^. We denote this data set by H3N2. 

(7) Stanford Dragon graphic. We sample points uniformly at random from 3-dimensional scans 
of the dragon 86 , whose reconstruction we show in Fig.[^ The sample sizes contain 1000 
and 2000 points. We denote these data sets by drag 1 and drag 2, respectively. 


(8) C. elegans neuronal network 108 . A weighted, undirected network in which each node is 
a neuron and edges represent synapses or gaps junctions. We denote this data set by eleg. 

(9) Human genome. A weighted, undirected network representing a sample of the human 
genome. We use the network studied using PH in [108| . (The authors of that paper created 


the sample using data retrieved from 36 .) Each node represents a gene, and weighted edges 


between nodes represent the correlation of the expression level of two genes. We denote 
this data set by genome. 

(10) Grey-scale image: 3-dimensional rotational angiography scan of a head with an aneurysm. 
This data set was used in the benchmarking in . This data set is given by a 3-dimensional 
array of size 512 x 512 x 512, with each entry storing an integer that represents the grey 
value for the corresponding voxel. We retrieved the data set from the repository [^. We 
denote this data set by vertebra. 

(11) US Gongress roll-call voting networks. These two networks (the Senate and House of Rep¬ 
resentatives from the 104th United States Congress) are constructed using the procedure 
in 128 from data compiled by [110 . In each network, a node is a legislator (Senators 
in one data set and Representatives in the other), and there is a weighted edge between 
legislators i and j, where the weight Wij is a number in [0,1] (it is equal to 0 if and only if 
legislators i and j never voted the same way on any bill) given by the number of times the 
two legislators voted in the same way divided by the total number of bills on which they 
both voted. See 128 for further details. We denote the networks from the Senate and 


House by senate and house, respectively. The network senate has 103 nodes, and the 
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network house has 445 nodes. To compute shortest paths, we define the distance between 
two nodes i and j to be 1 — Wij. In the 104th Congress, no two politicians voted in the 
same way on every bill, so we do not have distinct nodes with 0 distance between them. 
(This is important, for example, if one wants to apply multidimensional scaling.) 


(12) Network of network scientists. This is a weighted undirected network representing the 
largest connected component of a collaboration network of network scientists 


102 . Nodes 


represent authors and edges represent collaborations, where weights indicate the number 
of joint papers. The largest connected components consists of 379 nodes. We denote this 
data set by netw-sc. 




Figure 8: Plot of the image of the figure-8 immersion of the Klein bottle and the reconstruction of the 
Stanford Dragon (retrieved from 


86 


7.1.2 Machines and compilers 

We tested the libraries on both a cluster and a shared-memory system. The cluster is a Dell 
Sandybridge cluster, it has 1728 (i.e., 108 x 16) cores of 2.0GHz Xeon SandyBridge, RAM of 
64 GiB in 80 nodes and RAM of 128 GiB in 4 nodes, and a scratch disk of 20 TB. It runs 
the operating system (OS) Red Hat Enterprise Linux 6. The shared-memory system is an IBM 
System x3550 M4 server with 16 (i.e., 2x8) cores of 3.3GHz, RAM of 768 GB, and storage of 
3 TB. It runs the OS Ubuntu 14.04.0lj^ The major difference in running shared algorithms 
on the shared-memory system versus the distributed-memory system is that each node in the 
former has much more available RAM than in the latter. (See also the difference in performance 
between computations on cluster and shared memory system in Tables and i) To compile 
Gudhi, DIPHA, Perseus and Dionysus we used the compiler gcc 4.8.2 on the cluster, and 
gcc 4.8.4 on the shared-memory system; for both machines we used the (default) optimization 
-03. Additionally, we used openmpi 1.8.3 for DIPHA. 

7.1.3 Tests and results 

We now report the details and results of the tests that we performed. We have made the data 
sets, header file to measure memory, and other information related to the tests available at 
https;//github. com/n-otter/PH-roadmap, Of the six software packages that we study, four 
implement the computation of the dual algorithm, and four implement the standard algorithm. 
It is reported in [m that javaPlex implements the dual algorithm, but the implementation of 
the algorithm has a bug and gives a wrong output. To our knowledge, this bug has not yet been 
fixed (at the time of writing), and we therefore test only the standard algorithm. 

For the computations on the cluster, we compare the libraries running both the dual algorithm 
and the standard algorithm. The package DIPHA is the only one to implement a distributed 

^^Note that we performed the computations for Gudhi and RIPSEH. at a different point in time, during which 
the shared-memory system was running the OS Ubuntu 16.04.01. 
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computation. As a default, we run the software on one node and 16 cores; we only increase 
the number of nodes and cores employed when the machine runs out of memory. However, 
augmenting the number of nodes can make the computations faster (in terms of CPU seconds) 
for complexes of all sizesj^ We see this in our experiments, and it is also discussed in [^. For 
the other packages, we run the computations on a single node with one core. 

For computations on the shared-memory system, we compare the libraries using only the 
dual algorithm if they implement it, and we otherwise use the standard algorithm. For the 
shared-memory system, we run all packages (including DIPHA) on a single core. 

In our benchmarking, we report mean computation times and memory measurements. In 
Table we give the computation times for the different software packages. We measure elapsed 
and CPU time by using the time function in Linux. We report computation times with a 
precision of one second; if a computation took only fractions of a second, we report “one second” 
as the computation time. For space reasons, we report results for a subset of the computations, 
and we refer the reader to the SI for a tabulation of the rest of our computations. In Table we 
report the memory used by the processes in terms of maximum resident set size (RSS); in other 
words, we give the maximum amount of real RAM a program has used during its execution. We 
measure the maximum RSS using the getrusage function in Linux. The header file that we use 
to measure memory is available at https;//github. com/n-otter/PH-roadmap, In DIPHA, 
the measurement of memory is already implemented by the authors of the software. They also 
use the getrusage function in Linux. The package javaPlex is written in Java, and we thus 
cannot measure its memory as we do for the other packages. However, one can infer memory 
requirements for this software package using the value of the maximal heap size necessary to 
perform the computations; we report this value in Table In Table we give the maximum 
size of the simplicial complex for which we were able to compute PH with each software package 
in our benchmarkings. 


7.2 Conclusions from our benchmarking 


Our tests suggest that RIPSER is the best-performing library currently available for the com¬ 
putation of PH with the Vietoris-Rips complex, and in order of decreasing performance, that 
Gudhi and DIPHA are the next-best implementations. For the computation of PH with cubical 
complexes, Gudhi outperforms DIPHA by a factor of 3 to 4 in terms of memory usage, and 
DIPHA outperforms Gudhi in terms of wall-time seconds by a factor of 1 to 2 (when running 
on one core on a shared-memory system). Both DIPHA and Gudhi significantly outperform 
the implementation in Perseus. For the computation of PH with the alpha complex, we did not 
observe any significant differences in performance between the libraries Gudhi and Dionysus. 
Because the alpha complex has fewer simplices than the other complexes that we used in our 
tests, further tests with larger data sets may be appropriate in future benchmarkings. 

There is a huge disparity between implementations of the dual and standard algorithms. 
In our benchmarking, the dual implementations outperformed standard ones both in terms of 
computation time (with respect to both CPU and wall-time seconds) and in terms of the amount 
of memory used. This significant difference in performance and memory usage was also revealed 
for the software package Dionysus in 40 . 

To conclude, in our benchmarking, the fastest software packages were RIPSER, Gudhi, and 
DIPHA. For small complexes, the software packages Perseus and javaPlex are good choices, 
because they are the easiest ones to use. (They are the only libraries that need only to be 


^^Based on the results of our tests, we think of small, medium, and large complexes, respectively, as complexes 
with a size of order of magnitude of up to 10 million simplices, between 10 million and 100 million simplices, and 
between 100 million and a billion simplices. 
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Table 3: Performance of the software packages measured in wall-time (i.e., elapsed time), and CPU 
seconds (for the computations running on the cluster). For each data set, we indicate the size of the 
simplicial complex and the maximum dimension up to which we construct the VR complex. For all data 
sets, we construct the filtered VR complex up to the maximum distance between any two points. We 
indicate the implementation of the standard algorithm using the abbreviation “st” following the name of 
the package, and we indicate the implementation of the dual algorithm using the abbreviation “d.” The 
symbol signifies that we were unable to hnish computations for this data set, because the machine ran 
out of memory. Perseus implements only the standard algorithm, and Gudhi and ripser implement 
only the dual algorithm. (a,b) We run DIPHA on one node and 16 cores for the data sets eleg, Klein, 
and genome; on 2 nodes of 16 cores for the HIV data set; on 2 and 3 nodes of 16 cores for the dual 
and standard implementations, respectively, for drag 2; and on 8 nodes of 16 cores for random. (The 
maximum number of processes that we could use at any one time was 128.) (c) We run DIPHA on a 
single core. 


(a) Computations on cluster: wall-time seconds 


Data set 

eleg 

Klein 

HIV 

drag 2 

random 

genome 

Size of complex 

4.4 X lO*^ 

1.1 X 10'' 

2.1 X 10“ 

1.3 X 10^ 

3.1 X 10^ 

4.5 X 10« 

Max. dim. 

2 

2 

2 

2 

8 

2 

javaPlex (st) 

84 

747 

- 

- 

- 

- 

Dionysus (st) 

474 

1830 

- 

- 

- 

- 

DIPHA (st) 

6 

90 

1631 

142559 

- 

9110 

Perseus 

543 

1978 

- 

- 

- 

- 

Dionysus (d) 

513 

145 

- 

- 

- 

- 

DIPHA (d) 

4 

6 

81 

2358 

5096 

232 

Gudhi 

36 

89 

1798 

14368 

- 

4753 

RIPSER 

1 

1 

2 

6 

349 

3 


(b) Computations on cluster: CPU seconds 


Data set 

eleg 

Klein 

HIV 

drag 2 

random 

genome 

Size of complex 

4.4 X 10'> 

1.1 X 10^ 

2.1 X 10“ 

1.3 X lO'-* 

3.1 X 10^ 

4.5 X 10“ 

Max. dim. 

2 

2 

2 

2 

8 

2 

javaPlex (st) 

284 

1031 

- 

- 

- 

- 

Dionysus (st) 

473 

1824 

- 

- 

- 

- 

DIPHA (st) 

68 

1360 

25950 

1489615 

- 

130972 

Perseus 

542 

1974 

- 

- 

- 

- 

Dionysus (d) 

513 

145 

- 

- 

- 

- 

DIPHA (d) 

39 

73 

1276 

37572 

79691 

3622 

Gudhi 

36 

88 

1794 

14351 

- 

4764 

RIPSER 

1 

1 

2 

5 

348 

2 


(c) Computations on shared-memory system: wall-time seconds 


Data set 

eleg 

Klein 

HIV 

drag 2 

genome 

fract r 

Size of complex 

3.2 X 10“ 

1.1 X 10'' 

2.1 X 10“ 

1.3 X lO'-* 

4.5 X 10“ 

2.8 X 10“ 

Max. dim. 

3 

2 

2 

2 

2 

3 

javaPlex (st) 

13607 

1358 

43861 

- 

28064 

- 

Perseus 

- 

1271 

- 

- 

- 

- 

Dionysus (d) 

- 

100 

142055 

35366 

- 

572764 

DIPHA (d) 

926 

13 

773 

4482 

1775 

3923 

Gudhi 

381 

6 

177 

1518 

442 

4590 

RIPSER 

2 

1 

2 

5 

3 

1517 
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Table 4: Memory usage in GB for the computations summarized in Table|^ For javaPlex, we indicate 
the value of the maximum heap size that was sufficient to perform the computation. The value that we 
give is an upper bound to the memory usage. For DIPHA, we indicate the maximum memory used by 
a single core (considering all cores). See Tablefor details on the number of cores used. 


(a) Computations on cluster 


Data set 

eleg 

Klein 

HIV 

drag 2 

random 

genome 

Size of complex 

4.4 X lO*" 

1.1 X 10' 

2.1 X 10“ 

1.3 X 10^ 

3.1 X 10^ 

4.5 X 10** 

Max. dim. 

2 

2 

2 

2 

8 

2 

javaPlex (st) 

< 5 

< 15 

> 64 

> 64 

> 64 

> 64 

Dionysus (st) 

1.3 

11.6 


- 

- 

- 

DIPHA (st) 

0.1 

0.2 

2.7 

4.9 

- 

4.8 

Perseus 

5.1 

12.7 

- 

- 

- 

- 

Dionysus (d) 

0.5 

1.1 

- 

- 

- 

- 

DIPHA (d) 

0.1 

0.2 

1.8 

13.8 

9.6 

6.3 

Gudhi 

0.2 

0.5 

8.5 

62.8 

- 

21.5 

RIPSER 

0.007 

0.02 

0.06 

0.2 

24.7 

0.07 


(b) Computations on shared-memory system 


Data set 

eleg 

Klein 

HIV 

drag 2 

genome 

fract r 

Size of complex 

3.2 X 10“ 

1.1 X 10' 

2.1 X 10“ 

1.3 X 10^ 

4.5 X 10* 

2.8 X 10^ 

Max. dim. 

3 

2 

2 

2 

2 

3 

javaPlex (st) 

< 600 

< 15 

< 700 

> 700 

< 700 

> 700 

Perseus 

- 

11.7 

- 

- 

- 

- 

Dionysus (d) 

- 

1.1 

16.8 

134.2 

- 

268.5 

DIPHA (d) 

31.2 

0.9 

17.7 

109.5 

37.3 

276.1 

Gudhi 

15.4 

0.5 

10.2 

62.8 

21.4 

134.8 

RIPSER 

0.2 

0.03 

0.07 

0.2 

0.7 

155 


Table 5: For each software package in (a), we indicate in (b) the maximal size of the simplicial complex 
supported by it thus far in our tests. 


(a) 

javaPlex 

Dionysus 

DIPHA 

Perseus 

Gudhi 

RIPSER 

st 

st 

d 

st 

d 

st 

d 

d 

(b) 

4.5 ■ 10» 

1.1 ■ 10^ 

2.8 X 10^ 

1.3- lO'^ 

CO 

o 

«: 

1 ■ 10' 

3.4 ■ 10^ 

3.4- 10^ 
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downloaded and are then “plug-and-play,” and they have user-friendly interfaces.) Because the 
library javaPlex implements the computation of a variety of complexes and algorithms, we feel 
that it is the best software for an initial foray into PH computation. 

We now give guidelines for the computation of PH based on our benchmarking. We list 
several types of data sets in Table and indicate which software or algorithm that we feel is 
best-suited to each one. These guidelines are based on the findings of our benchmarking. Note 
that one can transform networks into distance matrices, and distance matrices can yield points in 
Euclidean space using a method such as multi-dimensional scaling. Naturally, given a finite set 
of points in Euclidean space, we can compute their distance matrix. As we discussed in Section 
|5.1[ image data can also be considered as a finite metric space, so the indications for distance 
matrices and points in Euclidean space also apply to image data. 

Table 6: Guidelines for which implementation is best-suited for which data set, based on our bench¬ 
marking. Recall that we indicate the implementation of the dual algorithm using the abbreviation “d” 
following the name of a package, and similarly we indicate the implementation of the standard algorithm 
by “st”. Note that for smaller data sets one can also use javaPlex to compute PH with VR complexes 
from points in Euclidean space, and Perseus to compute PH with cubical complexes for image data, and 
with VR complexes for distance matrices. The library jHoles can only handle networks with density 
much less than 1. 


Data type 

Complexes 

Suggested software 

networks 

WRCF 

j Holes 

image data 

cubical 

Gudhi or DIPHA (st) 

distance matrix 

VR 

RIPSER 

distance matrix 

W 

javaPlex 

points in Euclidean space 

VR 

Gudhi 

points in Euclidean space 

C 

Dionysus 

points in Euclidean space 

a (only in dim 2 and 3) 

Dionysus ((st) in dim 2, (d) in dim 3) or Gudhi 


8 Future directions 


We conclude by discussing some future directions for the computation of PH. As we saw in 
Section]^ much work has been done on step 2 (i.e., going from filtered complexes to barcodes) 
of the PH pipeline of Fig. and there exist implementations of many fast algorithms for the 
reduction of the boundary matrix. Step 1 (i.e., going from data to a filtered complex) of the PH 
pipeline is an active area of research, but many sparsification techniques (see, e.g., [7^pl^) for 
complexes have yet to be implemented, and more research needs to be done on steps 1 and 3 
(i.e., interpreting barcodes; see. e.g., 17 20, 122| ) of the PH pipeline. 

We believe that there needs to be a community-wide effort to build a library that implements 
the algorithms and data structures for the computation of PH, and that it should be done in a 
way that new algorithms and methods can be implemented easily in this framework. This would 
parallel similar community-wide efforts in fields such as computational algebra and computational 

and CGAL 


geometry, and libraries such as Macaulay2 66 , Sage 42 
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We also believe that there is a need to create guidelines and benchmark data sets for the test 
of new algorithms and data structures. The methods and collection of data sets that we used in 
our benchmarking provide an initial step towards establishing such guidelines and a list of test 
problems. 
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